Home
News Index >>
<< CNets Log
News Index
2002
Jan Feb Mar
Apr May Jun
Jul Aug Sep
2001
Jan Feb Mar
Apr May Jun
Jul Aug Sep
Oct Nov Dec
2000
Apr Aug Sep
Oct Nov Dec
Links
Fpga-cpu List
Usenet Posts
Site News
Papers
Teaching
Resources
Glossary
Gray Research
GR CPUs
XSOC
Launch Mail
Circuit Cellar
LICENSE
README
XSOC News
XSOC Talk
Issues
xr16
XSOC 2.0
XSOC2 Log
CNets
CNets Log
|
|
|
Welcome. The purpose of this site is to share the lore of designing
new processors and integrated systems-on-chips using FPGAs
(field-programmable gate arrays).
|
Ron Wilson, EE Times:
Avoidance proposed as solution to 90-nm problems. Very interesting.
"The notion that RTL must be a description of the wiring, not simply an
expression of the logic, recurred during the panel. It has also been
voiced frequently by design teams (not represented on the panel) that
are working with 130-nm designs. ..."
"The notion of the predesigned, configurable platform is beginning to get serious notice at 90 nm."
|
|
Happy new year (belated).
Embrace change
Anthony Cataldo, EE Times:
Altera to spin new FPGA for 90-nm production
Altera: Cyclone Devices ... Shipping Ahead of Schedule.
"With only 15 months from conception to shipment, the development of the Cyclone device family is the fastest in Altera's history."
Altera: ... Delivery of First Stratix GX Devices.
Now sampling.
Impressive. Congratulations. Execute, execute, execute.
Xilinx:
Enables Gibson Guitar's Best of Show Award.
I saw this at CES. A guitar with an ethernet jack.
"Gibson will offer MaGIC, an acronym for Media-accelerated Global Information Carrier, in every Gibson guitar within the next 12-18 months. ..."
"MaGIC uses state-of-the-art technology to provide up to 32 channels of
32-bit bi-directional high-fidelity audio with sample rates up to 192
kHz. Data and control can be transported 30 to 30,000 times faster than
MIDI."
Tom Hawkins of Launchbird Design Systems, Inc.,
announces Confluence 0.1.
"Confluence is a simple, yet amazingly powerful hardware design
language. Its flexibility and high level of expression reduces code
size and complexity of a design when compared with either Verilog or
VHDL. Confluence also enforces clean RTL preventing common errors and
bad design practices often introduced in traditional HDL coding."
"And unlike C based approaches, design engineers love Confluence because
it still feels like coding in HDL. The language is implicitly parallel
and very structural. ..."
"Confluence runs on Linux x86."
OK, but please let us know when you run on the volume platform.
Does Confluence employ OCaml? Interesting if so.
So far, details sketchy, but welcome, the more, the merrier.
Today's schedule of the
SDRForum Symposium on Use of Reconfigurable Logic in Software Defined Radios.
|
|
FPGA-FAQ has a nice fresh list of
FPGA boards.
Peter Clarke, Semiconductor Business News:
Former UK defense unit offers floating-point unit for FPGAs.
For MicroBlaze and the Virtex-II Pro's PowerPC(s).
QinetiQ [Quixilica].
'We're already seeing applications in image and signal processing systems,
control, and support of legacy hardware, where the combination of an
FPGA with an embedded microprocessor core and the FPU can provide the
functionality and performance of an entire DSP subsystem, said Bill Smith,
manager of QinetiQ's real-time systems laboratory, in statement.'
I've been to Malvern several times, lovely place.
|
|
Free Xilinx PicoBlaze Microcontroller Expands Support to Virtex-II Series FPGAs and CoolRunner-II CPLDs.
PicoBlaze User Resources.
Earlier
coverage.
Regarding PicoBlaze for CPLDs, e.g. CoolRunner-II, lacking any on-chip
block RAM instruction memory, the PB for CR2 requires you provide
an external 16-bit wide instruction RAM. This may prove prove
prohibitive in board area and cost. You can reduce the requirement
to 8-bit external memory using a few more macrocells, of course,
but in my opinion this application is a better fit for a device
with embedded block memory (e.g. Spartan-IIE, etc.).
This does illustrate the utility and value of a modest amount of
embedded RAM and/or FLASH in these larger CPLDs -- an idea whose time has come.
|
|
Xilinx:
Tarari adopts Xilinx Technology for Reconfigurable Content Processor Solutions.
"Tarari content processors are hardware and software-based subsystem
building blocks (silicon, boards, etc.) that snap into servers, appliances
and network devices, allowing for the first time the inspection of
application layer content at network speeds..."
Tarari.
Here, March: Applications of racks full of FPGA multiprocessors:
"I suppose my pet hand-wavy application for these concept chip-MPs is
lexing and parsing XML and filtering that (and/or parse table construction
for same). Let me set the stage for you. "
"Imagine a future in which "web services" are ubiquitous -- the internet
has evolved into a true distributed operating system, a cloud offering
services to several billion connected devices. Imagine that the current
leading transport candidate for internet RPC, namely SOAP -- (Simple
Object Access Protocol, e.g. XML encoded RPC arguments and return values,
on an HTTP transport, with interfaces described in WSDL (itself based
upon XML Schema)) -- imagine SOAP indeed becomes the standard internet
RPC. That's a ton of XML flying around. You will want your routers and
firewalls, etc. of the future to filter, classify, route, etc. that XML
at wire speed. That's a ton of ASCII lexing, parsing, and filtering. It's
trivially parallelizable -- every second a thousand or a million separate
HTTP sessions flash past your ports -- and therefore potentially a nice
application for rack full of FPGAs, most FPGAs implementing a 100-way
parsing and classification multiprocessor."
|
|
Lauro Rizzatti, in EEdesign:
Gates, lies and common sense.
Rizzatti revisits the marketing gates
issue.
"Realistically, now there is a simple, practical way to compare the design capacity of two emulation solutions based on the Virtex-II components. By listing type and quantity of Virtex-II devices allocated to mapping the design-under-test, possibly augmented by one or more external memory banks, you can now truthfully and reliably evaluate two or more emulation systems."
Well that's not very helpful. Far better is to simply
describe a capability vector of total resources. Then you can compare
across families and across vendors.
The vector should include (#LUTs, tILO,
amt. of each layer of memory hierarchy, external RAM).
Thus a system with two XC2V6000-5's might be
(68 KLUT, 410 ps, 1056 Kb LUT RAM, 2.6 Mb BRAM, ?) * 2 =>
(135 KLUT, 410 ps, 2 Mb LUT RAM, 5.2 Mb BRAM, ?)
and a system with four EP1S60s might be something like
(57 KLUT, ? ps, 574 M512s, 292 M4096, 6 MegaRAM, ?) * 4 =>
(228 KLUT, ? ps, 1.1 Mb M512s, 4.6 Mb M4096s, 13.5 Mb MegaRAM, ?).
If your problem domain warrants it, by all means, grow the
capability vector to include multiplier resources, embedded
processors, high speed serial resources, etc.
Congratulations to Altera for simply naming their new parts
with the most imortant element of this capability vector, KLUTs.
See also these
two
articles.
|
|
Ch-ch-ch-changes
I have returned full time to the software world;
without discussing specifics, my aim is to significantly
improve the lives of software developers and software users alike.
Fear not, I anticipate that this site will continue to report upon
news, and muse aloud about ideas, in the FPGA CPU and SoC space.
However, expect the reports to be more sporadic, and any musings to be
less elaborate.
Thanks giving
To my wonderful family, thank you. How happy I am that we are here
together to share life's rich pageant.
Thanks to my friends.
I am so fortunate to share friendship with some most excellent kindred
spirits who are so generous with their time, regard, insights, kindness,
well wishes, and good cheer.
Special thanks to those several of you whom I am privileged to count
as close friends.
Thank you for being one in a thousand.
I thank and remember those who have gone before, who lived and
worked and fought and died to make the world a happier
place for this ungrateful entitlement generation.
Many of us here in the western world have never known
want, disease, hunger, strife, nor war in our backyard.
Let us remember those that still live with these hardships.
Apropos of this site, I also thank the vast legions of hard working
engineers and scientists, and their collected and focused embodiments
in corporations, for ceaselessly advancing the science and the processes
and the devices and the platforms and the tools and the infrastructure
so as to deliver, free, the miracle of modern programmable logic,
that empowers even the little guy to turn ideas into tangible hardware.
And I thank you, dear reader, for frequenting this site, warts and all.
New Xilinx Spartan-IIE devices -- like manna from heaven
In September, Altera announced
Cyclone,
and last November, Xilinx announced
Spartan-IIE. Back then I wrote,
"You might think that as Virtex-E is to Virtex, so is Spartan-IIE to Spartan-II."
"But you would be wrong. According to data sheets, whereas an
XCV200 has 14 BRAMs (56 Kb) and the XCV200E has 28 BRAMs (112 Kb),
in the Spartan-II/E family, both the XC2S200 and (alas) the XC2S200E
have the same 14 BRAMs (56 Kb)."
"If your work is "BRAM bound", as is my multiprocessor research, this is
a disappointment."
Now Xilinx announces two new, larger Spartan-IIE devices,
the XC2S400E and XC2S600E. And lo and behold, unlike the BRAM deficient
XC2S300E, the 2S400E and 2S600E have the same BRAM to LUT ratios as the
original V400E and V600E. Thanks Xilinx!
A good thing too, for otherwise these parts would be seriously
RAM poor vis-a-vis their Cyclone competition.
Xilinx: ... Extends World's Lowest Cost FPGA Product Line.
FAQ.
Data sheet (alas, no single PDF).
"In 2003, the company is on track to deliver a fifth generation of the
Spartan Series, reaching even higher densities at significantly lower
price points."
Here is the updated competitive landscape.
Since Xilinx is making a big noise about the greater number of I/Os
available with Spartan-IIE devices
(an observation first
noted
by Rick "rickman" Collins), I thought I would oblige them and add a
column for I/O.
(The concept of the Cyclone parts, as I understand it, is the pad ring
limits determine the area of the device and hence the area for the
programmable logic fabric. So what, then, does the higher ratio of I/O
to logic in the Xilinx devices tell us?)
BRAM 02 03 04 03
Device Kb KLUT I/O BAP BAP BAP Ref $/KLUT
XCS05XL 0 0.2 77 $2.5 [3] $12.75
XC2S50E 32 1.5 182 $7 [2] $4.67
EP1C3 52 3 104 $7 $4 [1] $2.33
EP1C6 80 6 185 $17 $9 [1] $2.83
XC2S300E 64 6 329 $18 [2] $3.00
XC2S400E 160 10 410 $27 [4] $2.70
EP1C12 208 12 249 $35 $25 [1] $2.92
XC2S600E 288 14 514 $45 [4] $3.26
EP1C20 256 20 301 $60 $40 [1] $3.21
XC2V1000 640 10
EP1S10 752 11
EP1S20 1352 18
XC2V2000 896 22
BRAM Kb: Kbits of block RAM
(excludes parity bits, LUT RAM, and "M512s")
KLUTs: thousands of LUTs
I/O: maximum user I/O
BAP: approximate best announced price, any volume
$/KLUT: approximate 2003 BAP/KLUTs
References:
[1] Altera Cyclone Q&A:
"High-volume pricing (250,000 units) in 2004 for the EP1C3, EP1C6, EP1C12,
and EP1C20 devices in the smallest package and slowest speed grade will
start at $4, $8.95, $25, and $40, respectively. ...
Pricing for 50,000 units in mid-2003 for the EP1C3, EP1C6, EP1C12,
and EP1C20 devices in the smallest package and slowest speed grade will
start at $7, $17, $35, and $60, respectively."
[2] Xilinx Spartan-IIE press release:
"Second half 2002 pricing ranges from $6.95 for the XC2S50E- TQ144
(50,000 system gates) to $17.95 for the XC2S300E-PQ208 (300,000 system
gates) in volumes greater than 250,000 units."
[3] Xilinx Spartan prelease:
"Spartan pricing ranges from $2.55 for the XCSO5XL-VQ100 (5,000 system
gates) to $17.95 for the XC2S300E-PQ208 (300,000 system gates) in volumes
greater than 250,000 units."
[4] Xilinx 2nd Spartan-IIE press release:
"XC2S400E ... and XC2S600E ... and are priced at $27 and $45 respectively (250K volume)."
Other reports
Anthony Cataldo, EE Times:
Xilinx packs more I/O into its top-selling FPGA line.
Crista Souza, EBN:
Xilinx drives Spartan-IIE to high end.
Peter Clarke, Semiconductor Business News:
Xilinx adds two FPGAs to Spartan family.
(I think it is interesting to note that no one else picked
up on the much more generous servings of BRAM ports and bits
in the newer devices.)
What XC2S600E means to me
Please refer back to this piece
that sketches how in April '01 I PAR'd a multiprocessor of 12 clusters
of 5 processors in a single V600E, using 1 1/5 BRAMs per processor.
At the time, the V600E was not inexpensive.
Now with the advent of the XC2S600E, we can see practical and inexpensive
supercomputer scale meshes of simple processing elements implemented
completely and cost effectively in programmable logic.
At 60 processors per $45 device (in huge volumes), that works out to
just $0.75 per processing element. Loaded up with DRAM, this implies
a total component cost of ~$1.50/PE, and a density of about 20-40 processors
per square inch.
|
|
Xilinx: Free CoolRunner-II Design Kit.
"The Xilinx CoolRunner-II Design Kit is available free to qualified
customers through the Xilinx worldwide distributor base. The kit can also
be purchased direct from Xilinx for $49.99 through the online store ... "
The kit is apparently based upon the Digilab
XC2, and includes
an XC2C256-7TQ144.
Chris Edwards, EE Times:
300mm volume drive for Xilinx.
|
|
For those of you with Windows Media Player, take a look at
Jeff Bier of
BDTi's talk for Stanford
EE380,
Comparing FPGAs and DSPs for Embedded Signal Processing (ASX).
Highly recommended.
Slides.
"Conclusions: High-end FPGAs can wallop DSPs on computation-intensive,
highly parallelizable tasks ..."
Joel on Software:
The Law of Leaky Abstractions.
A new book by Henry S. Warren, Jr.:
Hacker's Delight,
is chock full of arcane bit twiddling tricks and folklore.
If you're the kind of person that knows what
((w-0x01010101)&~w&0x80808080) != 0
is good for, you'll love this book.
I picked up a copy at the OOPSLA bookstore, but over the weekend
I saw it at the Bellevue, WA, Barnes and Noble -- filed in the
Computer Security section, of course. I told the computer books clerk
that the book was misfiled, but I was told there was little that could
be done about it -- book shelving assignments come from on high.
"Finally, I should mention that the term 'hacker' in the title is meant in
the original sense of an afficianado of computers ... If you're looking
for tips on how to break into someone else's computer, you won't find
them here." -- Preface
|
|
The old ways hold us back
A long essay today: further reflections on OOPSLA, languages, and computer
architecture.
One of the emergent themes of this year's OOPSLA,
perhaps stirred up by James Noble and Robert Biddle's
Onward! track paper,
Notes on Postmodern Programming,
was some sober reflection on where object-oriented programming was,
and where it has gone.
The future (e.g. now) is not what some thought it would be
(e.g. ubiquity of very reflective and malleable and immediate
environments such as Smalltalk-80 and Self).
Instead C++ has won (or perhaps the winner is Visual Basic),
with Java and C# catching up.
(I am unconvinced that languages and environments where a major
tenet of reuse is implementation inheritance, and where there is
poor support for arms-length-composition of separately authored,
versioned, and deployed software components, ever stood a serious chance of
scaling up into ubiquity, but that's another diatribe, for another day.)
(Incidentally, Noble said that when this
Slashdot
thread ran,
the resulting internet traffic to fetch his paper slashdotted
all of New Zealand!)
Several prominent attendees expressed the sentiment that our languages
and platforms have been shaped by historical constraints that no
longer apply. But their legacy lives on, and perhaps, holds us back.
(Intel 386 marketing slogan: "Extended the Legacy of Leadership.")
I don't buy much of that, by the way. (I know too many grandmas
who do stunning things with their computers.)
But the fact remains that C and even C++ have seen their day,
and in many domains, it is time to let go of them, and move on.
Software archaeology uncovers the dominant paradigm
First let's do a gedanken experiment. Take your Windows PC, or your
OS X Macintosh, or your Linux box, and freeze it in mid-computation,
and save a snapshot of the entire memory image. Maybe you have
a 128 MB dump. Now you put on your "software archaeologist" pith helmet, and
you spend the next three years of your life pouring over the dump, picking
through the strata, cataloging the arcane bits of code and data that you uncover.
If you do that, I assure you that you will find, amongst other things, hundreds
or even thousands of separate, mediocre, linked list and hash table
implementations. You will find tens of thousands of loops
written in C/C++ that amount to
for (ListElement* p = head; p; p = p->next) { ... }
That is the dominant paradigm. It stinks. It is holding us back.
This little idiom and its brethren, carved in diamond in
untold billions of dollars worth of genuinely useful software intellectual
property, is the carrot that leads Intel and AMD and the rest,
to build 100 million and soon one billion transistor processors
where the actual computing takes place in less than
a million of those transistors.
What is the intention behind this code? To do something with a
growable collection of values -- search it, map it, collect a subset.
On modern machines, this code stinks. A full cache miss wastes
many hundreds of instruction issue slots. Your hundreds of millions
of transistors sit around twiddling their thumbs. Until each
preceding pointer completes its odyssey...
(Upon determining that the miscreant pointer value has gone AWOL
from on-chip caches, the processor sends out a subpoena compelling its
appearance; this writ is telegraphed out to the north bridge and thence to
the DRAM; the value, and its collaborators in adjacent cells in the line,
then wend their way, from DRAM cells, through sense amps, muxes, drivers,
queueing for embarkation at the DRAM D/Q pins, then sailing across the PCB,
then by dogsled across the north bridge, then by steamer across the
PCB again, finally landing at the processor pins, and then, dashing
across the processor die and into the waiting room of the L1 D-cache ...)
... until that happens, the poor processor can't make any significant
progress on the next iteration of the computation.
This scenario is so bad and so common that the microprocessor vendors
use 80% of their transistor budgets for on-chip caches --
Intel as glorified SRAM vendor.
Unfortunately it is really hard to make compilers and computer
architectures transform this hopelessly serial pointer-following
problem into something that can go faster.
The Sapir-Whorf hypothesis
The tragedy is that the intention behind the code -- do something
with a collection of values -- can often be run in parallel
in O(1) or O(lg n) time.
But the language and idiom of the dominant paradigm moulds programmer's
thoughts and actions and they produce this serial pointer-following junk.
Tim Budd, An Introduction to Object-Oriented Programming:
"Sapir and Whorf went further, and claimed that there were thoughts
one could have in one language that could not ever occur, could not
even be explained, to somebody thinking in a different language. This
stronger form is what is known as the Sapir-Whorf hypothesis, and remains
controversial. It is interesting to examine both of these forms in the
area of artificial computer languages."
If as a C programmer, all you've ever seen or been taught is that
you make variable sized collections using linked lists, and that
you traverse them using pointer following, then it follows that all
you're ever going to write is the same deathly serial
for loop we saw above.
But you could do better. You could have learned the method
of data abstraction, and reused an abstract data type (ADT) library
to implement your collection.
Then you might have called some of
ListContains(list, element);
ListMap(list, pfnMapElement);
ListSelect(list, pfnPredicate);
That would be a great step forward, because then the implementation
of the list collection could evolve without modifying all
the client code. You could hire the world's expert on growable list
collections, and her enhancements could benefit all clients of the ADT.
Over time, you could also benefit from new data structure and algorithm
discoveries, such as
skip lists.
And over the years, as machine architectures change, the implementation
could be retuned. For example, instead of a linked list of nodes,
even today's cache oriented scalar machines would benefit from
a new list structure that clusters nodes together into a cache line,
perhaps by making multi-element supernodes, or by employing a specialized
memory allocator. Similar attention to page locality could also pay off
handsomely.
Of course, few C programmers write code this way,
and no standard C collection classes have been adopted by the industry.
Not the dominant paradigm: Scheme and Smalltalk-80
The C programming community rarely takes notice of the
lessons of the LISP, Scheme, and Smalltalk-80 communities.
Scheme is a beautiful language. A small, clean, lexically scoped Lisp,
multiparadigm (you can write pure functional, imperative,
and object-oriented programming styles, amongst others), Scheme provides
powerful abstraction-building facilities, including lambdas (unnamed
functions as values), higher order functions (functions on functions),
closures, and continuations.
(As a dyed-in-the-wool C/C++ hacker, I wish I were fully fluent in Scheme.)
My first exposure to closures was in Smalltalk-80. Smalltalk
has a construction called a block. It is an anonymous
function of 0 or more arguments that you can directly use in expressions.
The body of a Smalltalk block has direct access to its enclosing
blocks' and method's variables.
And significantly, the block is a first class object, that you
can squirrel away in a data structure and call later, and again, and again.
Using blocks, the designers of Smalltalk built a set of powerful
collection class facilities that provided not only data abstraction,
but also control abstraction.
For example, in Smalltalk, you can write:
us_residents <- addresses select: [:each | each country == #USA]
meaning, take the collection of Address objects, and a predicate
which determines whether each address' country is USA,
and return a new collection with just those addresses whose country
field is USA.
(In Smalltalk speak, we would explain the above as:
"send the addresses collection a message select: with an
argument block that takes one argument each; to determine
the value of the block, send each the country message;
send the response to that, the == message with an argument
being the Symbol #USA, and return the response as the value
of the block.)
Here we didn't specify how to iterate over the collection,
that is left abstract and up to the collection class.
That's control abstraction.
In particular, it might be possible to concurrently evaluate the predicate
for each element of the collection (in parallel).
(Strictly speaking, in Smalltalk-80, there was an implicit understanding
that this computation would proceed in a serial fashion, but the
point remains.)
Enter C++
C++ came on the scene in the late 1980s, and the mainstream moved
from C to C++ by 1995 or so.
C++'s built-in support for data abstraction and object-oriented programming
made the practice of using ADTs more commonplace. By the late 1990s, the
Standard Template Library
promised to deliver powerful, efficient, reusable standardized collection
classes to C++ programmers.
(STL's use of templates can be rather obscure. But for real
mind-blowing use of template classes with template parameters that
are themselves template classes, and with template member functions,
and specializations, and macros, oh my! -- check out the
Boost Libraries.)
Now I have not used STL much -- I used it in some prototypes of
CNets2000 -- and it seems like a big step forward over not-invented-here
roll-your-own collection classes that do not compose with each
other -- but it is very iterator-centric.
To the extent that programmers use iterators, compilers are obliged
to generate code that exhibits the observable semantics. Serial semantics.
Still, we are making progress here. We can certainly evolve
the implementation of our STL classes over time without having
to modify the vast client code base.
Enter C#
C# has been shipping in Visual Studio.NET since this past February.
It's a nice language. In my experience C# -- and its environment,
the .NET Framework class libraries -- offer dramatically improved
programmer productivity as compared to C/C++.
C# programmers have access to a rich set of collection class libraries.
When you use these libraries in your code, not only do you benefit
from not having to maintain the library code yourself, but you can
arguably expect it to improve over time.
Now last Thursday, at OOPSLA, Anders Hejlsberg gave a keynote address
on the present design of C#, and on four proposed new C# language features
(more on those in the next section).
As he was explaining the many nice convenience features of C#,
(none of which add the horrible orthogonal-complexity problems that C++ had),
Hejlsberg came to foreach
and the IEnumerable and IEnumerator interfaces. In C#, if your class
implements interface IEnumerable, you can use it in a foreach loop:
using System;
using System.Collections;
class Range : IEnumerable { ... }
class Main {
public static void Main(string[] args) {
foreach (int i in Range(10)) { ... }
}
}
Now when I first saw C# enumerators I thought "not bad, but they
don't provide control abstraction or the opportunity to go more
parallel over time -- pity".
Anonymous methods, or why the future looks bright indeed
Well, the talk just got better and better as Hejlsberg covered four
new language features under consideration for C#. These include
generics (parametric types), iterators, anonymous methods, and
partial types.
New C# language features page.
Hejlsberg's
slides (PPT).
Now let us focus on the proposed anonymous methods, which seem
just like lambda functions with closure semantics.
When/if C# provides them, then it will be convenient and
natural to write methods such as the earlier client of
Collection>>select:.
Hejlsberg's example:
delegate bool Filter(object obj);
public class ArrayList {
public ArrayList Select(Filter matches) {
ArrayList result = new ArrayList();
foreach (object obj in this) {
if (matches(obj)) result.Add(obj);
}
return result;
}
}
(Here matches is the predicate function that determines
whether to add each element to the result collection.)
The example continues with an application of Select():
public class Bank {
ArrayList accounts;
ArrayList GetLargeAccounts(double minBalance) {
return accounts.Select(
new Filter(a) {
return ((Account)a).Balance >= minBalance;
});
}
}
Beautiful! Control abstraction, with no visible "serial iteration" lines!
Note here that the above anonymous method is able to reference
the local variable minBalance from wherever it is called.
For reference, here's the same thing in Smalltalk:
getLargeAccounts: minBalance
^accounts select: [:each | each balance >= minBalance]
Implications for mainstream computer architecture
If you partake in the Kool-Aid ...
"... we are moving to a world that there are basically two places that code runs -- the JVM and the CLR. ..."
... you can see where this is going.
First, both the .NET Framework, and the Java platform,
embrace multithreaded programming and make it more manageable.
Programmers are going to be more familiar and more comfortable
with concurrency.
(See .NET Asynchronous Programming
and/or the Asynchronous Method Invocation section in
Don Box's Essential .NET Volume 1.)
Second, with this proposed anonymous method facility, C# programmers
might find it much more convenient, natural even, to write code
that is implicitly parallelism friendly.
These new developments may be just the thing to break the
chicken-and-egg deadlock on chip multiprocessors.
Once we have a body of important commercial software that can demonstrably
take advantage of 4, 16, or 64 processors on a chip, then
can we get back to riding a steep performance growth curve that is
otherwise prone to level out.
I will be very disappointed if, ten years from now, the best use of
a multibillion transistor substrate is a four or eight way chip multiprocessor.
We can do much better than that, if we evolve our language and idioms
and embrace parallelism.
"Parallelism: the new imperative." -- Jim Gray, Microsoft.
"Notation is a tool of thought." -- Ken Iverson.
"I don't know who discovered water, but it wasn't a fish." -- Marshall McLuhan.
|
|
All your bits and bobs
Yesterday, I attended an OOPLSA tutorial on Rotor Internals.
Microsoft has just released
Rotor 1.0,
their shared source common language infrastructure.
Of late I've been exploring the last
Rotor beta, it's pretty interesting -- and vast.
Drinking the Kool-Aid:
"There's a flood coming and its going to wash away people who don't make this change. ... if you talk to people who really look at the trends in this industry ... it feels ... we are moving to a world that there are basically two places that code runs -- the JVM and the CLR. ..."
To be clear, this is not referring to embedded systems development. Yet.
Altera White Paper: Delivering RISC Processors in an FPGA for $2.00.
John Kent has some
FPGA CPU and
system
experiments.
Loarant's AX1610.
16-bit RISC in ~360 slices in Virtex-derivative architectures.
(C compiler?)
Why do we build scalar RISC processors? Because most software intellectual
property is entombed in dusty deck C/C++.
Bernd Paysan:
b16 A Forth Processor in an FPGA.
"Flex10K30E: About 600 LCs, the unit for logic cells in
Altera. The logic to interface with the eval board
needs another 100 LCs. The slowest model runs at up
to 25MHz."
Like the gr00x0, it's a "literate
Verilog design", e.g. the write-up is the source. I like the way he
uses *WEB to permit arbitrary order of presentation of source.
|
|
Altera:
Stratix GX Devices: Altera Integrates 3.125-Gbps Transceivers with World's Fastest FPGA.
Altera's Stratix GX's 3.125 Gbps transceivers are a big step up from their
Mercury family, and Stratix GX joins Virtex-II Pro in the ranks of fast,
large FPGA fabrics with integrated 3.125 Gbps serial transceivers.
Architecture.
Overview.
Data sheet.
Q&A.
Stratix GX Devices & Nios Processor.
"The Stratix GX device's innovative MultiTrack interconnect structure improves overall system performance of the Nios processor to over 150 MHz."
What does that mean? Simply put, will Nios run at up to one instruction
per clock at 150 MHz in Stratix GX?
Stratix & Stratix GX Device Architectural Differences. As you can see, the largest Stratix GX devices offer only ~1/3 the LUTs of the largest Stratix devices.
This is in contrast with the Xilinx strategy, where the largest
Virtex-II Pro device (2VP125, ~111000 LUTs) is comparable to the largest
Virtex-II device (2V10000, ~123000 LUTs).
Some transceiver differentiators:
-
"At 75mW per channel and only 450mW per gigabit transceiver block, Stratix GX transceivers consume less than half the power of competing FPGA solutions."
-
"Dedicated circuitry for XAUI and SONET."
-
Hard dynamic phase alignment: "DPA simplifies high-speed board design and layout through the automatic elimination of skew introduced by unmatched trace lengths, jitter, and other skew-inducing effects".
(Note, I'm just quoting the press release here, I do not have enough
insight into the problem to tell whether these dedicated features
truly offer a better solution (shorter time to market, easier to design)
than one fashioned out of programmable logic.)
Naive question:
Will all these emerging 3+ Gbps serial transceivers directly interoperate?
Anthony Cataldo, EE Times:
FPGA vendors position for serial I/O battle.
Anthony Cataldo, EE Times:
Lattice FPGA integrates 3.7-Gbit/s serdes transceiver.
|
|
Next week I'll be at OOPSLA'02,
with a brief excursion to the Bay Area on Wednesday.
Maybe I'll see you there.
|
|
Proactive service packs
- Today I received an email notification from Xilinx that 5.1i SP2
is available for download.
- Once you visit the Updates Center,
you discover "SP3 is scheduled for release December 11, 2002".
Nice work.
Xilinx's system builder IDE
Xilinx:
Xilinx Delivers New ISE Embedded Deveopment Kit for the Fastest FPGA Processor Solution in the Industry.
Embedded Development Kit.
EDK IP Cores.
Note that some cores are "Additional High Value" cores.
Xilinx makes good on their
earlier ISE 5.1i statement
to roll-out additional tools by year end.
Welcome, Xilinx Platform Studio, and System Generator for Processors
(two separate tools?),
to the system
builder space.
The competition will be good for everyone.
Again, the question for (from) third party IP providers: how to
package up our IP to make it available to customers
composing systems with these system builder IDEs?
Xilinx's co-design platform
Xilinx Expands Programmable Systems Solution with Groundbreaking Co-Design Technology.
"The new technology expands the company's current solution for
programmable systems by enabling customers to define an entire system in
ANSI-C to obtain the most optimal implementation by rapidly partitioning
and repartitioning between hardware and software. ..."
"The technology ... a library of hardware and software components,
called Processing Elements (PE) optimized for particular functions. This
capability enables the customer to use a best-in-class and domain-specific
tool to create an optimized PE. A re-partition is a compile time
switch, which enables one to profile, convert to a hardware/software
implementation and debug in a matter of minutes rather than days or
weeks. The hardware and software PEs come from a variety of sources,
including Xilinx and third-party AllianceEDA, AllianceCORE, and Embedded
Tools partners. ..."
"Commercial release of the tools is expected in mid-2003."
Earlier teaser press release.
This is an important development and is distinct from the system builder IDE
discussion. For large systems, composed of a great many function blocks,
it is important to have a way to explore the system partitioning --
what functions or tasks can migrate to software, what functions to hardware,
how many CPUs do I drop into this system, should I use a hardware, software,
or hybrid (software with coprocessor or special-purpose instructions
and function units)? For time to market reasons, amongst others,
you need a platform that lets both your software developers and your hardware
designers get cracking on the problem even as the "system architect" is still
getting a grip on the entire solution space. This could be a platform
that lets you make these trade-offs late in the development cycle;
a platform that lets you make derivative products with different
trade-offs without starting over from scratch.
Observe that Xilinx is partnering with many of the leading EDA
vendors, who have considerable experience with this problem in
the ASIC space.
This will probably be a complex product, an expensive product, and
an enabling product for high end designs, but will probably
be of little interest to designers of relatively simple embedded
systems.
This kind of product seems imperative for the FPGA vendors,
who need to make it easier (or at least make it possible) for
their customers to quickly come to market with designs that
exploit the vendors' newest, largest, highest margin devices.
It seems challenging to design and build a usable, coherent, and
effective tool in the face of multi-organizational and multi-disciplinary
considerations. (Several of Xilinx's partners are themselves competitors...)
And it will be interesting to see if and how these two sets
of products (system builder IDEs and co-design/architectural synthesis
platforms) will be integrated, and will interop.
As I wrote earlier,
"The end of monolithic microprocessors
The keynote speech by Henry Samueli of Broadcom aptly demonstrated that
we have left the era of monolithic microprocessors (except perhaps for
personal computers) and are now working into the era of highly integrated
systems. Take for example, the Broadcom
BCM3351
(PDF) VoIP Broadband Gateway, which integrates a myriad functions into a
single chip. Oh yes, by the way, there is a tiny MIPS core down in there,
somewhere."
That's a good example of a complex SoC that needs a hardware/software
co-design environment.
What marketing copy writers hear
Recalling the old Far Side comic strip,
What Dogs Hear:
"blah blah GROUNDBREAKING blah blah THE MOST OPTIMAL blah blah
blah blah ALL-ENCOMPASSING blah blah FULLY EXPLOIT blah blah
UNIQUE blah blah BEST-IN-CLASS blah blah
MORE THAN 50 PERCENT MARKET SHARE blah blah
DE FACTO STANDARD METHODOLGY blah blah"
We know technology leadership when we see it, so it is
unnecessary to beat us over the head.
Other coverage
Michael Santarini, EE Times:
Xilinx pitches kit to embedded, software engineers.
"The announcements put Xilinx closer to its goal of making FPGAs available
to the embedded-design market and, ultimately, to software engineers,
said Per Holmberg, director of programmable-systems marketing. Combined,
he said, those sectors represent hundreds of thousands of potential new
users for programmable-logic devices."
Great quote
Anthony Cataldo, EE Times:
QuickLogic puts hard cores into its FPGAs.
'"Xilinx has convinced the world that a lemon called volatility is lemonade called reprogrammability," Hart said.'
We appreciate reconfigurability -- but we also appreciate convenient,
secure, low-cost non-volatile configuration solutions.
See e.g. the new Lattice
ispXPGA
with integrated EEPROM configuration memory, and the new Altera
Cyclone configuration devices ("Each configuration device costs
on average 10 percent of its corresponding Cyclone device").
|
|
Xilinx demonstrates practical application of FPGA partial reconfiguration
Mike Butts brought to my attention that the new Xilinx Crossbar Switch
uses partial reconfiguration of Virtex-II CLB switch matrices to build
a remarkably large crossbar.
Xilinx Announces Industry's First Programmable Crossbar Switch Solution.
Crossbar Switch.
Partial Reconfiguration for the Crossbar Switch.
"Switching is achieved by dynamic partial reconfiguration through the
FPGA configuration interface. By this mechanism, one or more changes to
the input-output mapping can be made in less than 280 microseconds even
in the largest FPGA devices."
White paper (registration required).
The white paper describes the design of a 928x928 crossbar that runs at 155 MHz.
Were it built using 928 copies of a 928-1 mux built out LUTs, it would require ~232,000 slices (464,000 LUTs).
However, this crossbar switch very cleverly uses the Virtex-II CLB
switch box itself to implement a 33x8 crossbar tile, so the
entire 928x928 crossbar requires only 58x58 CLBs (including apparently
~27000 pass-through LUTs), apparently a factor of 17X denser
than could be accomplished using muxes built out of LUTs themselves.
The design flow is unusual, using XDL to build the initial design
and jbits to configure the switches.
Not being a big user of big crossbar switches myself, :-), I cannot tell
whether this system is practical and attractive for commercial use,
whether the peculiar design flow, or the best-case latency to change one or
more switches in a column (minimum of 240-280 us) is prohibitive.
I also wonder what the latency is to make random (not precomputed)
switch changes. That is, if I suddenly want out[765]=in[321],
how long does it take to compute the new reconfiguration
bitstream frames before downloading them? Is that included in the
240-280 us?
P4 vs FPGA MPSoC
On the fpga-cpu list, John Campbell
asked:
"A CPU programmed in a FPGA is always going to be handicapped
in clock speed relative to a conventional microprocessor.
Whats the best we can do currently? 50MHz or so ? Pretty
dismal against 2GHz for a current high end pentium."
My reply:
The high end Pentium 4 approaches 3 GHz now. The ALUs are double pumped,
so each one can do up to 6 Gops. There are two such ALUs, plus
a third "slow" ALU. Since the pipeline can issue at most 3 uops per cycle,
we have a maximum throughput of 3*3GHz = 9 Gops.
In practice, you won't see anything like that, because of branch
mispredicts, cache misses, and other perils. For example,
for a single cache miss that goes all the way out to main memory and is
an activated-page-miss in the DRAM, the latency could easily be 100 ns.
That's 100000 ps / 333 ps = 300 clock cycles or nearly a thousand
potential uop issue slots. They don't call it the "memory wall" for
nothing.
A high end FPGA CPU is only ~150 MHz. But you can multiply instantiate
them. I have an unfinished 16-bit design in 4x8 V-II CLBs that does
about 167 MHz and includes a pipelined single-cycle multiply-accumulate.
You should be able to put 40 of them in a 2V1000 for a peak 16-bit computation rate
(never to exceed) of 333 Mops * 40 = ~12 Gops.
(Actually, somewhat less. In an
earlier MPSoC exploration I found
that you must derate the single PE performance by 10-20% when you make
a fabric of them, because the tools let routing spill over into adjacent
PE tiles, even if the logic itself is fully floorplanned.)
In a monster 2VP100 or
2VP125 you're looking at up to 10X that -- perhaps 50 Gmacs (100 Gops).
(Whether your problem can exploit that degree of parallelism, or whether
the part can handle the power dissipation of such a design, I just don't
know.)
When the Pentium 4 goes to main memory, it takes 50-150 ns. When the
FPGA CPU multiprocessor goes to main memory, it also takes 50-150 ns. If
the problem doesn't fit in cache, the P4 does not look so good.
Each P4 offers (with the help of a northbridge chipset) external
bandwidth of 3.2 GB/s (64-bits at 100 MB/s-quad-pumped). Each 2V1000
offers external bandwidth of at least 8 GB/s (e.g. go configure yourself
four 133-MHz 64-bit (~105-pin) DDR-DRAM channels).
When the Pentium 4 mispredicts a branch, it takes many, many (up to ~20)
cycles to recover. When the FPGA CPU core takes a branch (or not),
it wastes 0 or 1 cycles. If you are spending cycles parsing text,
the random nature of the data can eliminate many of the benefits of a
deeeeeeeeeeeeeeeeeep pipeline.
If I had to run Office, I'd rather have a P4.
If I had to classify XML data on the wire at wire speed, I'd rather have
an FPGA MPSoC or a mesh of same.
I think most of you will enjoy this
lecture.
|
|
Graham Seaman of (the excellent) Open Collector, in
EEdesign:
Open-source cores provide new paths to SoCs.
Graham interviews Rudolf Usselmann of
ASICS.ws and
OpenCores.
"... So far we have ended up providing many, many hours of free tech
support which almost crippled our company. Now we've started sending
out friendly replies asking people to pay us for any support they might
require."
Clive Maxfield, EEdesign:
Reconfiguring chip design.
More on the QuickSilver Adaptive Computing Machine.
"The solution is to use a heterogeneous architecture that fully addresses the heterogeneous nature of the algorithms it is required to implement. ..."
"... a scalar node can be used to execute legacy code ..."
QuickCores
QuickCores' Press Release:
QuickCores Announces MUSKETEER IP Delivery System (PDF).
Targets Actel's
ProASICPLUS FLASH-based, non-volatile, reprogrammable FPGAs.
"Re-programmable ASIC on a postage stamp features built-in JTAG real-time debugger, built-in boundary scan controller, built-in device programmer, and downloadable microcontroller IP.":
"Micros to Download?"
"Yes, times have changed. With the MUSKETEER, you can simply download your microcontroller from your favorite IP provider's web site and program it into the MUSKETEER as it downloads."
QuickCores' downloadable microcontroller cores.
Products. Single-unit
prices from $175.00.
|
|
1, 2, 3, and so on
Oh my gosh --
this site is the #1 FPGA search result from
Altavista,
#2 from AllTheWeb, and
#3 from Google!
To spiders and readers alike, thanks for visiting!
Another homebrew CPU
Bill Buzbee:
Magic-1 Homebrew Computer.
This terrific web site describes Mr. Buzbee's endeavor to
build a new, microcoded, 16-bit (22-bit real address) processor,
from scratch, wire-wrapped, in good old TTL (not FPGAs).
Besides the hardware, the project also includes an lcc port
and a microcode simulator.
I encourage you to explore and appreciate
the many interesting pages, particularly the
diary.
|
FPGA CPU News, Vol. 3, No. 10
Back issues:
Vol. 3 (2002): Jan Feb Mar Apr May Jun Jul Aug Sep;
Vol. 2 (2001): Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec;
Vol. 1 (2000): Apr Aug Sep Oct Nov Dec.
Opinions expressed herein are those of Jan Gray, President, Gray Research LLC.
Powered by bash and sed,
inspired by Scripting News.
|